Regression?
Regression!

PSCI 8357 - STAT II

Georgiy Syunyaev

Department of Political Science, Vanderbilt University

February 2, 2026

DiM vs. Regression


  • So far we considered difference in means as our naive estimator of causal quantities.
  • This week we will see, that we might use regression agnostically to estimate causal estimands as well.

    • this makes our life easier, especially if we would like to rely on conditional ignorability assumption. (Why?)
  • BUT this only solves the estimation problem.

    • We still have to make assumptions to achieve causal identification!
  • Problem: If we want to learn about relationship between \(X\) and \(Y\)

    • The ideal is to learn about \(f_{YX}(\cdot)\),
    • In practice we learn about \({\mathbb{E}}[Y {\:\vert\:}X]\).

CEF

Conditional Expectation Function (CEF)

CEF

The CEF, \({\mathbb{E}}[Y_i {\:\vert\:}X_i]\), is the expected value of \(Y_i\) across values of \(X_i\):

  • For continuous \(Y_i\) \[ {\mathbb{E}}[Y_i {\:\vert\:}X_i] = \int_{\mathcal{Y}} y f(y {\:\vert\:}X_i) \, dy \]

  • For discrete \(Y_i\): \[ {\mathbb{E}}[Y_i {\:\vert\:}X_i] = \sum_{\mathcal{Y}} y p(y {\:\vert\:}X_i) \]

  • Population-Level Function: Describes the relationship between \(Y_i\) and \(X_i\) in the population (finite or super).
  • Functional Flexibility: Can be non-linear (!).

Decomposition of Observed Outcomes

CEF Decomposition Property

\[ Y_i = \underbrace{{\mathbb{E}}[Y_i {\:\vert\:}X_i]}_{\text{explained by $X_i$}} + \underbrace{\varepsilon_i}_{\text{unexplained}}, \]

where \({\mathbb{E}}[\varepsilon_i {\:\vert\:}X_i] = 0\) and \(\varepsilon_i\) is uncorrelated with any function of \(X_i\)

  • Intuition: The CEF isolates the systematic component of \(Y_i\) explained by \(X_i\), while \(\varepsilon_i\) captures noise.
  • To see this property recall

    \[ \begin{align*} \varepsilon_i &= Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i] \quad \implies\\ {\mathbb{E}}[\varepsilon_i {\:\vert\:}X_i] &= {\mathbb{E}}[Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i] {\:\vert\:}X_i] = 0 \end{align*} \]

  • also \({\mathbb{E}}[h(X_i) \varepsilon_i] = 0\). (How can we use Law of Iterated Expectations to prove this?)

Best Minimal MSE Predictor

CEF Prediction Property

\[ {\mathbb{E}}[Y_i {\:\vert\:}X_i] = {\arg\!\min}_{g(X_i)} {\mathbb{E}}\left[ (Y_i - g(X_i))^2 \right], \] where \(g(X_i)\) is any function of \(X_i\).

  • Intuition: CEF is the best method for predicting \(Y_i\) in the least squares sense.
  • To see this property decompose the squared expression:

\[ \begin{align*} (Y_i - g(X_i))^2 &= \left(Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i] + {\mathbb{E}}[Y_i {\:\vert\:}X_i] - g(X_i)\right)^2 \\ &= \left(Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i]\right)^2 + 2\left(Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i]\right)\left({\mathbb{E}}[Y_i {\:\vert\:}X_i] - g(X_i)\right) \\ &\quad + \left({\mathbb{E}}[Y_i {\:\vert\:}X_i] - g(X_i)\right)^2. \end{align*} \]

Discrete Case CEF


Density distributions show the spread of \(Y\) values at each discrete \(X\); black line connects the conditional means.

  • The CEF is the average line through the scatter of data points for each discrete \(X\).

Why Does CEF Matter?



  • The CEF properties we just established are important because:

    1. Decomposition: Any outcome can be split into a systematic part (explained by covariates) and noise.

    2. Optimality: The CEF is the best predictor of \(Y_i\) given \(X_i\) in the MSE sense.

  • Key insight: If we can estimate \({\mathbb{E}}[Y_i {\:\vert\:}T_i]\) or \({\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i]\) well, we can estimate differences in conditional means—which under the right assumptions are causal effects.
  • The question becomes: Can regression help us estimate the CEF?

Regression Justification

Regression?



  • The \({\mathbb{E}}[Y_i {\:\vert\:}X_i]\) quantity looks very familiar, we already used in \({\mathbb{E}}[Y_i {\:\vert\:}T_i]\) or \({\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i]\).

  • We want to see if regression helps us with estimating these quantities. Especially when we want to estimate differences in means.

  • Note: There is nothing causal in \({\mathbb{E}}[Y_i {\:\vert\:}T_i]\) or \({\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i]\), so we still need identification

    • We can for example rely on strong or conditional ignorability.

Some Regression Coefficient Properties

  • Before we move on to this, we need to recall important facts about regression coefficients

    1. Population regression coefficients vector is given by (directly follows from \({\mathbb{E}}[X_i \varepsilon_i] = 0\)) \[ \beta = {\mathbb{E}}[X_i X_i^{\prime}]^{-1} {\mathbb{E}}[X_i Y_i] \]

    2. Regression coefficient in single covariate case is given by (population and sample analog) \[ \beta = \frac{{\mathrm{cov}}(Y_i,X_i)}{{\mathbb{V}}(X_i)}, \quad \widehat{\beta} = \frac{\sum_{i = 1}^{n} (Y_i - \bar{Y})(X_i - \bar{X})}{\sum_{i = 1}^{n} (X_i - \bar{X})^2} \]

    3. Regression coefficient in multiple covariate case is given by \[ \beta_{k} = \frac{{\mathrm{cov}}(\tilde{Y}_i,\tilde{X}_{ki})}{{\mathbb{V}}(\tilde{X}_{ki})}, \] where \(\tilde{X}_{ki}\) is the residual from regressing \(X_k\) on \(X_{-k}\)

Justification 1: Linearity


Theorem: Linear CEF

If CEF \({\mathbb{E}}[Y_i {\:\vert\:}X_i]\) is linear in \(X_i\), then the population regression function \(X_i^{\prime} \beta\) returns exactly \({\mathbb{E}}[Y_i {\:\vert\:}X_i]\).

  • To see this property we can

    • Use decomposition property of CEF to see \({\mathbb{E}}[ X_i (Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i]) ] = 0\)

    • Substitute for \({\mathbb{E}}[Y_i {\:\vert\:}X_i] = X_i^{\prime} b\) and solve

  • How plausible is this linearity assumption in practice?
  • Always true in the simple case where \(T_i\) is a binary treatment indicator: \({\mathbb{E}}[Y_i {\:\vert\:}T_i] = \beta_0 + \beta_1 T_i\).

Binary Case CEF


Justification 2: Linear Approximation

  • What if the CEF is not linear?
  • Regression can still be used to approximate the CEF:

Regression Prediction Property

The function \(X_i' \beta\) provides the Minimal MSE linear approximation to \({\mathbb{E}}[Y_i | X_i]\), that is:

\[ \beta = {\arg\!\min}_b {\mathbb{E}}\left[ ({\mathbb{E}}[Y_i | X_i] - X_i' b)^2 \right]. \]

  • Intuition: Even if CEF is not linear we can use regression to approximate it and make substantive conclusions
  • To see this we can decompose the squared error function minimized by OLS

\[ \begin{align*} (Y_i - X_i' b)^2 &= \left( (Y_i - {\mathbb{E}}[Y_i | X_i]) + ({\mathbb{E}}[Y_i | X_i] - X_i' b) \right)^2 \\ &= (Y_i - {\mathbb{E}}[Y_i | X_i])^2 + ({\mathbb{E}}[Y_i | X_i] - X_i' b)^2 \\ &\quad + 2 (Y_i - {\mathbb{E}}[Y_i | X_i]) ({\mathbb{E}}[Y_i | X_i] - X_i' b). \end{align*} \]

  • The first term doesn’t involve \(b\).
  • The last term has an expectation of zero due to the CEF-decomposition property.

Approximation of Discrete Case CEF


What Does This All Mean?




  • In the case of CEF with respect to binary \(X_i\) (think \(T_i\)), OLS provides estimate of \({\mathbb{E}}[Y_i {\:\vert\:}X_i]\) which is the same as difference in means.
  • In the case of CEF linear in \(X_i\), OLS provides estimate of \({\mathbb{E}}[Y_i {\:\vert\:}X_i]\) which is (constant) increase in means of \(Y_i\).
  • In the case of CEF non-linear in \(X_i\), OLS provides best linear approximation of \({\mathbb{E}}[Y_i {\:\vert\:}X_i]\)

Regression and Causality

Back to Simple Binary Setup


  • Suppose \(\mathcal{T} = \{0, 1\}\)

  • Under SUTVA (no interference and consistency) POs are \(Y_{i} (1)\) and \(Y_{i} (0)\).

  • A unit-level treatment effect is, \(\tau_i = Y_{i} (1) - Y_{i} (0)\)

    • \({\mathbb{E}}[\tau_i] = {\mathbb{E}}[Y_{i} (1) - Y_{i} (0)] = \tau_{ATE}\) is the average treatment effect (\(ATE\)).
  • We observe \(X_i\), \(T_i\) and, \(Y_i = T_i Y_{i} (1) + (1 - T_i )Y_{i} (0)\).

  • In this simple case OLS estimator solves the least squares problem:

    \[ (\widehat{\tau}, \widehat{\alpha}) = {\arg\!\min}_{\tau, \alpha} \sum_{i=1}^n \left(Y_i - \alpha - \tau T_i\right)^2 \]

  • Coefficient \(\tau\) is algebraically equivalent to the difference in means (\(\tau_{DiM}\)):

    \[ \widehat{\tau} = \bar{Y}_1 - \bar{Y}_0 = \widehat{\tau}_{DiM} \]

Regression Justification

  • Key assumptions: linearity and mean independence of errors. (why do we care about the latter?)
  • Using switching equation we can show that:

\[ \begin{align*} Y_i &= T_i Y_i(1) + (1 - T_i) Y_i(0) \\ &= Y_i(0) + T_i ( Y_i(1) - Y_i(0) ) \quad\text{($\because$ distribute)}\\ &= Y_i(0) + \tau_i T_i \quad \text{($\because$ unit treatment definition)}\\ &= {\mathbb{E}}[Y_i(0)] + \tau T_i + ( Y_i(0) - {\mathbb{E}}[Y_i(0)] ) + T_i (\tau_i - \tau) \quad (\because \pm {\mathbb{E}}[Y_i(0)] + \tau T_i)\\ &= {\mathbb{E}}[Y_i(0)] + \tau T_i + (1 - T_i)( Y_i(0) - {\mathbb{E}}[Y_i(0)] ) + T_i (Y_i(1) - {\mathbb{E}}[Y_i(1)]) \quad\text{($\because$ distribute)}\\ &= \alpha + \tau T_i + \eta_i \end{align*} \]

  • Linear functional form fully justified by SUTVA assumption alone:

    • Intercept: \(\alpha = {\mathbb{E}}[Y_i(0)]\) (average control outcome).
    • Slope: \(\tau = {\mathbb{E}}[Y_i(1) - Y_i(0)]\) (average treatment effect).
    • Error: deviation of control PO + treatment effect heterogeneity. What is the second interpretation?

Mean independent errors

  • The error is given by

\[ \eta_i = (1 - T_i)( Y_i(0) - {\mathbb{E}}[Y_i(0)] ) + T_i (Y_i(1) - {\mathbb{E}}[Y_i(1)]) \]

  • In regression context we would like \({\mathbb{E}}[\eta_i {\:\vert\:}T_i] = 0\)?

\[ \begin{align*} {\mathbb{E}}[\eta_i {\:\vert\:}T_i] &= {\mathbb{E}}[(1 - T_i)( Y_i(0) - {\mathbb{E}}[Y_i(0)] ) + T_i (Y_i(1) - {\mathbb{E}}[Y_i(1)]) {\:\vert\:}T_i] \\ &= (1 - T_i) ({\mathbb{E}}[Y_i(0) {\:\vert\:}T_i] - {\mathbb{E}}[Y_i(0)]) + T_i ({\mathbb{E}}[Y_i(1) {\:\vert\:}T_i] - {\mathbb{E}}[Y_i(1)]) \end{align*} \]

  • Does this look familiar? This is selection wrt to \(Y_i(0)\) and \(Y_i(1)\).
  • When would this be equal to zero? E.g. under random assignment (strong ignorability)
  • Randomization + consistency allow linear model.

    • Does not imply homoskedasticity or normal errors, though!

    • Practical implication: Use heteroskedasticity-robust (HC2) standard errors for inference, e.g. via lm_robust().

Regression with Covariates

Selection on Observables


  • Under strong ignorability we can use regression to estimate causal effects of interest.

  • What if instead we assume that selection depends on a set of observed covariates \(X_{i}\), i.e. there is selection on observables?

  • This implies the conditional ignorability assumption, i.e.

\[ \{ Y_i(0), Y_i(1) \} {\mbox{$\perp\!\!\!\perp$}}T_i {\:\vert\:}X_i. \]

  • This in turn implies \(\eta_i {\mbox{$\perp\!\!\!\perp$}}T_i {\:\vert\:}X_i\) (Why?)
  • Note: If conditional ignorability is true and we are able to condition on \(X_i\), the following is also true

\[ \eta_i {\mbox{$\perp\!\!\!\perp$}}T_i {\:\vert\:}X_i \implies {\mathbb{E}}[\eta_i {\:\vert\:}T_i=1, X_i] = {\mathbb{E}}[\eta_i {\:\vert\:}T_i=0, X_i] = {\mathbb{E}}[\eta_i {\:\vert\:}X_i]. \]

Linear POs

  • Now assume constant linear treatment effects, i.e. POs are given by

\[ f_i(t) = \alpha + \tau t + \eta_i \]

  • Observed outcomes are given by:

    \[ Y_i = f_i(T_i) = \alpha + \tau T_i + \eta_i, \]

    where \(\eta_i\) captures all variable determinants of \(f_i(T_i)\) other than \(T_i\).

  • We also allow \(\eta_i\) to depend on covariates we identified:

    \[ \eta_i = X_i^{\prime} \gamma + \nu_i, \]

    where \(\gamma\) is the population regression solution.

  • Orthogonality of residuals to regressors in the population implies \({\mathbb{E}}[X_i \nu_i] = 0\).

Linear POs


  • We further assume linearity in \(X_i\) as well (!)

\[ {\mathbb{E}}[\eta_i | X_i] = X_i^\prime \gamma \]

  • Since when \(X_i\) is fixed only \(\nu_i\) is varying in \(\eta_i\), under this model we have

    \[ f_i(t) {\mbox{$\perp\!\!\!\perp$}}T_i {\:\vert\:}X_i \implies \nu_i {\mbox{$\perp\!\!\!\perp$}}T_i {\:\vert\:}X_i, \]

  • With strong ignorability and linearity, we obtain:

\[ Y_i = \alpha + \tau T_i + X_i^{\prime} \gamma + \nu_i, \]

where \(\nu_i\) is uncorrelated with \(X_i\) and also with \(T_i\) conditional on \(X_i\).

From Linear POs to Regression

  • By conditional ignorability,

\[ {\mathbb{E}}[f_i(t) {\:\vert\:}T_i = t, X_i] = {\mathbb{E}}[f_i(t) {\:\vert\:}X_i] = \alpha + \tau t + X_i^{\prime} \gamma. \]

  • Then,

\[ \begin{align*} {\mathbb{E}}[f_i(t) &- f_i(t - v) {\:\vert\:}X_i] \\ &= (\alpha + \tau t + X_i^{\prime} \gamma) - (\alpha + \tau (t - v) + X_i^{\prime} \gamma) \\ &= \tau v \end{align*} \]

  • (\(X_i\) disappeared because confounding enters linearly and cancels out.)
  • Result:

    • So, \(\tau\) is the causal effect of a unit change in \(t\).
    • \(\nu_i\) is uncorrelated with \(X_i\) and \(T_i\), so OLS is consistent for \(\tau\).

Omitted Variable Bias

Omitted Variable Bias (OVB)

  • Now suppose we erroneously omit \(X_i\), and just regress \(Y_i\) on \(T_i\) via OLS.
  • To see omitted variable bias we look at what the coefficient on \(T_i\) estimates, \(\frac{{\mathrm{cov}}(Y_i, T_i)}{{\mathbb{V}}(T_i)}\) assuming that the true model should include \(X_i\):

\[ \begin{align*} {\mathrm{cov}}(Y_i, T_i) &= {\mathrm{cov}}(\alpha + \tau T_i + X_i' \gamma + \nu_i,\, T_i) \\ &= \tau {\mathrm{cov}}(T_i, T_i) + {\mathrm{cov}}(X_{1i} \gamma_1 + \ldots + X_{Ki} \gamma_K, T_i) \\ &= \tau {\mathbb{V}}(T_i) + \gamma_1 {\mathrm{cov}}(X_{1i}, T_i) + \ldots + \gamma_K {\mathrm{cov}}(X_{Ki}, T_i) \end{align*} \]

\[ \implies \frac{{\mathrm{cov}}(Y_i, T_i)}{{\mathbb{V}}(T_i)} = \tau + \underbrace{\gamma^{\prime} \delta}_{\text{OVB}} \]

where \(\delta\) are coefficients from regressions of \(X_1, \ldots, X_K\) on \(T_i\).

  • By the Frisch–Waugh–Lovell theorem, if we include some of \(X_i\) we will get \(\frac{{\mathrm{cov}}(\tilde{Y}_i, \tilde{T}_i)}{{\mathbb{V}}(\tilde{T}_i)} = \tau + \tilde{\gamma}^{\prime}\tilde{\delta}\), where \(\tilde{\cdot}\) means residualized with respect to included terms from \(X_i\).

Omitted Variable Bias


  • OVB = \(\gamma^\prime \delta\), where

    • \(\gamma\) is the vector of effects of confounders on the outcome.
    • \(\delta\) is the vector of associations between confounders and treatment — i.e., the degree of confounder-induced imbalance in treatment assignment.
  • Same holds when we consider the case where we include some controls:

    \[ \text{OVB} = \tilde{\gamma}' \tilde{\delta}. \]

    Everything is just defined in terms of variables that have been residualized with respect to the included controls.

  • OVB = confounder impact \(\times\) imbalance (Cinelli and Hazlett 2020).

Omitted Variable Bias

  • Let’s practice applying the OVB formula:

    OVB = \((X_{ki}, Y_i)\) relationships \(\times\) \((X_{ki}, T_i)\) relationships

G Y Y T T T->Y X X X->Y X->T

  1. Effect of democratic institutions on growth, estimated via regression of growth on democratic institutions.

  2. Effect of exposure to negative advertisements on turnout, estimated via regression of turnout on the number of ads seen.

  • Question: What is a possible omitted variable? How will this bias the estimate?

OVB: Simulate DAG Relationship

set.seed(20250127) # set seed

n <- 1000 # sample size
tau <- 0.5 # ATE
gamma <- 0.3 # effect of confounder on outcome
delta <- 0.3 # proportional to effect of treatment on confounder (only if less than sd ratio)

# confounder
confounder <- rnorm(n, mean = 50, sd = 10)

# democratic institutions (correlated with confounder)
democracy_score <- delta * confounder + rnorm(n, mean = 0, sd = 5)

# economic growth (influenced by both investment and democratic institutions)
growth <- tau *
  democracy_score +
  gamma * confounder +
  rnorm(n, mean = 0, sd = 5)

# true regression including the confounder
model_unbiased <- lm(growth ~ democracy_score + confounder)
cat("Unbiased model error:", unname(model_unbiased$coefficients[2]) - tau, "\n")

Unbiased model error: -0.01922573

# regression ignoring the confounder
model_biased <- lm(growth ~ democracy_score)
cat("Biased model error:", unname(model_biased$coefficients[2]) - tau, "\n")

Biased model error: 0.3081032

OVB: High \(\gamma\), High \(\delta\)

OVB: High \(\gamma\), Low \(\delta\)

OVB: Low \(\gamma\), High \(\delta\)

OVB: Low \(\gamma\), Low \(\delta\)

Be Careful!



  • Omitted variables is a misleading term because it could suggest that you want to include any variable that is correlated with treatment and outcomes.

  • But remember bad controls exist, e.g.

    • Common descendants of treatment and outcome (colliders)
    • Block causal path by controling for mediator
    • \(M\)-bias variables

Two Ways to Adjust for Covariates


  • The discussion of OVB suggests that we can use regression to adjust for variables (\(X_i\)) to estimate the treatment effect (\(\tau\)) in two ways.

    1. Long regression: Include covariates \(X_i\) directly in the regression model.

    2. Residualized regression:

      1. Purge variation in \(Y_i\) due to \(X_i\) \(\rightarrow\) Regress \(Y_i\) on \(X_i\) and calculate residual outcomes: \(\tilde{Y}_i = Y_i - \widehat{Y}_i\).
      2. Purge variation in \(T_i\) due to \(X_i\) \(\rightarrow\) Regress \(T_i\) on \(X_i\) and calculate residual treatments: \(\tilde{T}_i = T_i - \widehat{T}_i\).
      3. Regress \(\tilde{Y}_i\) on \(\tilde{T}_i\).
  • Result: Coefficient on \(T_{i}\) in long regression and on \(\tilde{T}_i\) in residualized regression are identical.

Back-Door Criterion

Identification Analysis with Causal Graphs

  • An alternative, perhaps more intuitive, way to think about confounding is in terms of DAGs.
  • Suppose we want to estimate the \(ATE\) of \(T\) on \(Y\); which covariates do we need to measure?
  • Pearl develops criteria, which can be directly read off the graph alone.
  • Before studying the criteria, we need to define some new concepts.
  • Nodes: \(T\), \(Y\), \(Z_1\), \(Z_2\), and \(Z_3\).

  • Paths: \(T \to Y\), \(T \leftarrow Z_3 \to Y\), \(T \leftarrow Z_1 \to Z_3 \leftarrow Z_2 \to Y\), etc.

  • \(Z_1\) is a parent of \(T\) and \(Z_3\).

  • \(T\) and \(Z_3\) are children of \(Z_1\).

  • \(Z_1\) is an ancestor of \(Y\).

  • \(Y\) is a descendant of \(Z_1\).

G T T Y Y T->Y Z_1 Z_1 Z_1->T Z_3 Z_3 Z_1->Z_3 Z_2 Z_2 Z_2->Y Z_2->Z_3 Z_3->T Z_3->Y

Blocked Paths and \(d\)-separation

Definition: Blocked Paths

A set of nodes \(X\) blocks a path \(p\) if either:

  1. \(p\) contains at least one arrow-emitting node in \(X\), OR
  2. \(p\) contains at least one collision node that is outside \(X\) and has no descendant in \(X\).
  • \(T \leftarrow W_1 \leftarrow Z_1 \to Z_3 \to Y\) is blocked by \(\{W_1\}\), \(\{Z_1\}\), \(\{Z_1, Z_3\}\), etc.
  • \(T \leftarrow W_1 \leftarrow Z_1 \to Z_3 \leftarrow Z_2 \to W_2 \to Y\) is blocked by \(\{\emptyset\}\) (an empty set).

G T T W_3 W_3 T->W_3 Y Y Z_1 Z_1 Z_3 Z_3 Z_1->Z_3 W_1 W_1 Z_1->W_1 Z_2 Z_2 Z_2->Z_3 W_2 W_2 Z_2->W_2 Z_3->T Z_3->Y W_1->T W_2->Y W_3->Y

Definition: \(d\)-separation

If \(X\) blocks all paths from \(T\) to \(Y\), then \(X\) \(d\)-separates \(T\) and \(Y\).
If \(X\) \(d\)-separates \(T\) and \(Y\), then \(Y {\mbox{$\perp\!\!\!\perp$}}T {\:\vert\:}X\).

  • \(W_1\) and \(Z_3\) are \(d\)-separated by set \(X = \{Z_1\}\).

The Back-Door Criterion for Causal Identification

Theorem: The Back-Door Criterion

A set \(X\) is sufficient for adjustment to identify the causal effect of \(T\) on \(Y\) if:

  1. The elements of \(X\) block all back-door paths from \(T\) to \(Y\) (confounders, but not colliders), and
  2. No element of \(X\) is a descendant of \(T\) on a direct path to \(Y\) (no post-treatment),
  • Important: The back-door criterion tells you which covariates to condition on to identify a causal effect, given a hypothesized DAG.

G T T W_3 W_3 T->W_3 Y Y Z_1 Z_1 Z_3 Z_3 Z_1->Z_3 W_1 W_1 Z_1->W_1 Z_2 Z_2 Z_2->Z_3 W_2 W_2 Z_2->W_2 Z_3->T Z_3->Y W_1->T W_2->Y W_3->Y


  • Example: Which variables should we control for to identify the effect of \(T\) on \(Y\)?
    • \(X = \{W_1, W_2\}\)? No.
    • \(X = \{Z_1, Z_3\}\)? Yes!
    • \(X = \{Z_3\}\)? No, because it unblocks \(T \leftarrow W_1 \leftarrow Z_1 \to Z_3 \leftarrow Z_2 \to W_2 \to Y\).

The Good, The Bad, The Ugly… Controls

Choosing Controls: A Taxonomy



  • Follow Cinelli, Forney, and Pearl (2024) which provides a systematic framework for thinking about control variables.

  • Key insight: Not all variables that are correlated with treatment and outcome should be controlled for!

  • We will classify controls as:

    1. Good controls: Block back-door paths without introducing bias
    2. Neutral controls: Neither help nor hurt identification (but may affect precision)
    3. Bad controls: Introduce bias through collider conditioning or post-treatment adjustment

Good Controls 1


  • confounder is a common cause of main explanatory variable, \(X\), and outcome of interest, \(Y\).
  • In model (a) \(Z\) is a common cause of \(X\) and \(Y\). Controlling for \(Z\) blocks the back-door path.
  • In models (b) and (c) \(Z\) is not a common cause, but controlling for \(Z\) blocks the back-door path due to unobserved confounder \(U\).

G cluster_c (c) cluster_b (b) cluster_a (a) X_c X Y_c Y X_c->Y_c Z_c Z Z_c->Y_c U_c U U_c->X_c U_c->Z_c X_b X Y_b Y X_b->Y_b Z_b Z Z_b->X_b U_b U U_b->Y_b U_b->Z_b Z Z X X Z->X Y Y Z->Y X->Y

Good Controls 2



  • Intuition: Common causes of \(X\) and any mediator \(M\) (between \(X\) and \(Y\)) also confound the effect of \(X\) on \(Y\).
  • Models (a)-(c) are analogous to the models without mediator – controlling for \(Z\) blocks the back-door path from \(X\) to \(Y\) (through \(M\)) and produces an unbiased estimate of the \(ATE\).

G cluster_b (b) cluster_a (a) cluster_c (c) X_c X M_c M X_c->M_c Y_c Y M_c->Y_c Z_c Z Z_c->M_c U_c U U_c->X_c U_c->Z_c X_b X M_b M X_b->M_b Y_b Y M_b->Y_b Z_b Z Z_b->X_b U_b U U_b->M_b U_b->Z_b Z Z X X Z->X M M Z->M X->M Y Y M->Y

Neutral (?) Controls



  • Intuition: Ancestors of only \(X\), only \(Y\), or only \(M\) (mediator) do not introduce bias. Controling for these factors will reduce variation in respective variable that is not related to the variation in other variable.
  • In model (a) reduction in variation is good! \(\rightarrow\) higher precision

  • In model (b) reduction in variation is bad! \(\rightarrow\) lower precision

  • In model (c) reduction in variation is good again! \(\rightarrow\) higher precision

G cluster_a (a) cluster_b (b) cluster_c (c) X_c X M_c M X_c->M_c Y_c Y M_c->Y_c Z_c Z Z_c->M_c X_b X Y_b Y X_b->Y_b Z_b Z Z_b->X_b Z Z Y Y Z->Y X X X->Y

Bad Controls: Selection Bias



  • Intuition: We do not want to control for colliders or their descendants. This induces selection bias
  • In models (a) and (b) controlling for \(Z\) unblocks back-door paths and induces relationship between \(X\) and \(Y\).

  • In models (c) and (d) controlling for \(Z\) will unblock the back-door path \(X \rightarrow Z \leftarrow U \rightarrow Y\).

G cluster_a (a) cluster_d (d) cluster_c (c) cluster_b (b) X_d X Y_d Y X_d->Y_d Z_d Z Z_d->Y_d U1_d U1 U1_d->X_d U1_d->Z_d U2_d U2 U2_d->Y_d U2_d->Z_d X_c X Y_c Y X_c->Y_c Z_c Z U1_c U1 U1_c->X_c U1_c->Z_c U2_c U2 U2_c->Y_c U2_c->Z_c X_b X Y_b Y X_b->Y_b Z_b Z X_b->Z_b U_b U U_b->Y_b U_b->Z_b Z Z X X X->Z Y Y X->Y Y->Z

Bad Controls: Selection Bias

Bad Controls: Post-Treatment Bias



  • Intuition: We do not want to block the channels through which the effect goes (unless we are interested in \(CATE\)). This induces post-treatment bias
  • In models (a) and (b) controlling for \(Z\) blocks the causal path.

  • In model (c) controlling for \(Z\) blocks part of the causal path.

  • In model (d) controlling for \(Z\) will not block the causal path or induce any bias.

G cluster_a (a) cluster_d (d) cluster_c (c) cluster_b (b) X_d X Y_d Y X_d->Y_d Z_d Z X_d->Z_d X_c X Y_c Y X_c->Y_c Z_c Z X_c->Z_c Z_c->Y_c X_b X M_b M X_b->M_b Y_b Y M_b->Y_b Z_b Z M_b->Z_b Z Z Y Y Z->Y X X X->Z

Bad Controls: Post-Treatment Bias


  • To see the intuition behind post-treatment bias consider the following example

  • Suppose \(X = 0, 1\) randomly assigned, and then

    \[ \begin{align*} Z &= X + \varepsilon_Z, \\ Y &= \beta X + \gamma Z + \varepsilon_Y, \end{align*} \]

    where \(\varepsilon_Z\) and \(\varepsilon_Y\) are independent standard normal draws.

  • Substituting in \(Y\):

    \[ Y = (\beta + \gamma)X + \gamma \varepsilon_Z + \varepsilon_Y \]

  • Effect of \(X\) on \(Y\) is \(\beta + \gamma\).

  • Controlling for \(Z\), we would estimate an effect of \(\beta\).

  • The bias, \(-\gamma\), is the portion of the effect that has been “stolen away” by conditioning on \(Z\).

Controls Conclusion



  • Be mindful of what controls you include in your analysis (even if it is an experiment).

  • Draw a DAG with controls you plan to include and see whether

    • You need them to block any back-door paths.
    • They might be colliders or introduce post-treatment bias.
    • Do not use “kitchen sink” approach!
  • Be also mindful of the sizes of the effects of potential confounders. If the effect on main independent and dependent variable can be proven to be limited, the OVB is small!

Regression with Heterogeneous Treatments

Effect Heterogeneity


  • Thus far we have simplified things by assuming constant effects (\(\tau_i = \tau\) for all \(i\)) and linearity (\({\mathbb{E}}[\eta_i {\:\vert\:}X_i] = X'_i \gamma\)).
  • These are strong assumptions!

  • What if they are false? Let’s see.

  • Suppose

    • \(\mathcal{T} = \{0,1\}\),
    • Potential outcomes \((Y_{i} (0), Y_{i} (1))\),
    • Heterogeneous treatment effects, \(\tau_i = Y_{i} (1) - Y_{i} (0)\),
    • \({\mathbb{E}}[\eta_i {\:\vert\:}X_i] = f(X_i)\)
  • Let conditional independence assumption (CIA) hold: \(Y_{i} (0), Y_{i} (1)) {\mbox{$\perp\!\!\!\perp$}}T_i {\:\vert\:}X_i\).

Effect Heterogeneity

  • In this case the effect of interest is still

    \[ \begin{align*} \tau_{ATE} &= {\mathbb{E}}_{X} [{\mathbb{E}}[Y_i (1) {\:\vert\:}X_i] - {\mathbb{E}}[Y_i (0) {\:\vert\:}X_i]] \\ &= {\mathbb{E}}_{X} [{\mathbb{E}}[Y_i (1) {\:\vert\:}T_i = 1, X_i] - {\mathbb{E}}[Y_i (0) {\:\vert\:}T_i = 0, X_i]] \\ &= \sum_{X} \tau_x {\textrm{Pr}}(X_i = x), \end{align*} \]

    where \(\tau_x \equiv {\mathbb{E}}[Y_i (1) {\:\vert\:}X_i = x] - {\mathbb{E}}[Y_i (0) {\:\vert\:}X_i = x]\)

  • How would we estimate this using regression?
  • We can use saturated (or one-way fixed effects) OLS regression model

    \[ Y_i = \alpha_0 + \tau T_i + \mathbb{1} [X_i = x_2] \alpha_{x_2} + \dots + \mathbb{1} [ X_i = x_L ] \alpha_{x_L} + \varepsilon_i, \]

    where \(\mathbb{1} [\cdot]\) denotes indicator of event \(\cdot\); \(x_2, \dots, x_L\) exhausts all possible \(X_i\) values omitting one (?) from the specification.

  • This is the best we can do with OLS regression with controls. (Why?)

Regression Anatomy

  • Recall regression anatomy: \(\widehat{\tau} = \frac{{\mathrm{cov}}(\tilde{Y}_i, \tilde{T}_i)}{{\mathbb{V}}(\tilde{T}_i)}\), where \(\tilde{T}_i\) is residuals from regression of \(T_i\) on other regressors

  • Let’s see if it actually works

# simulate data
n <- 1000
X <- rnorm(n)
D <- 0.5 * X + rnorm(n) # do not use T!!!
Y <- 2 * D + 1 * X + rnorm(n)

# standard regression
standard <- coef(lm(Y ~ D + X))["D"]

# make Y tilde and D tilde
tilde_Y <- lm(Y ~ X)$residuals
tilde_D <- lm(D ~ X)$residuals

# regression anatomy
anatomy <- coef(lm(tilde_Y ~ tilde_D))["tilde_D"]

# simplified regression anatomy
anatomy_simp <- coef(lm(Y ~ tilde_D))["tilde_D"]

data.frame(
  Method = c("Standard", "Regression Anatomy", 
  "Regression Anatomy (Simplified)"),
  Coefficient = c(standard, anatomy, anatomy_simp)
) |>
  knitr::kable(digits = 3)
Method Coefficient
Standard 1.978
Regression Anatomy 1.978
Regression Anatomy (Simplified) 1.978

Regression Anatomy

  • Recall that \(\widehat{\tau} = \frac{{\mathrm{cov}}(Y_i, \tilde{T_i})}{{\mathbb{V}}(\tilde{T_i})}\), where \(\tilde{T_i}\) is residuals from regression of \(T_i\) on other regressors

    \[ \begin{align*} \widehat{\tau} &= \frac{{\mathrm{cov}}(Y_i, \tilde{T_i})}{{\mathbb{V}}(\tilde{T_i})}\\ &= \frac{{\mathbb{E}}[{\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i](T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])] - \textcolor{#d65d0e}{{\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i]{\mathbb{E}}[T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i]]} ] }{{\mathbb{E}}[(T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])^2]}\\ &= \frac{{\mathbb{E}}[{\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i](T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])]}{{\mathbb{E}}[(T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])^2]}. \quad \text{($\because$ independence of residuals)} \end{align*} \]

  • Now let’s look at the first term in the numerator

    \[ \begin{align*} {\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i] &= T_i{\mathbb{E}}[Y_{i} (1) - Y_{i} (0) {\:\vert\:}T_i, X_i] + \textcolor{#458588}{{\mathbb{E}}[Y_{i} (0) {\:\vert\:}T_i, X_i]} \quad \text{($\because$ switching equation)}\\ &= T_i{\mathbb{E}}[Y_{i} (1) - Y_{i} (0) {\:\vert\:}T_i, X_i] + {\mathbb{E}}[\textcolor{#458588}{Y_{i} (0)} {\:\vert\:}T_i = 0, X_i] \quad \text{($\because$ CIA)}\\ &= T_i \textcolor{#458588}{{\mathbb{E}}[Y_{i} (1) - Y_{i} (0) {\:\vert\:}T_i, X_i]} + {\mathbb{E}}[Y_{i} {\:\vert\:}T_i = 0, X_i] \quad \text{($\because$ switching equation)}\\ &= T_i \tau_X + {\mathbb{E}}[Y_{i} {\:\vert\:}T_i = 0, X_i] \quad \text{($\because$ definition of $\tau_X$)} \end{align*} \]

Regression Anatomy

\[ \begin{align*} \widehat{\tau} &= \frac{{\mathbb{E}}[{\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i](T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])]}{{\mathbb{E}}[(T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])^2]} \\ &= \frac{{\mathbb{E}}[\textcolor{#458588}{\left( T_i \tau_X + {\mathbb{E}}[Y_{i} {\:\vert\:}T_i = 0, X_i] \right)} (T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])]}{{\mathbb{E}}[(T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])^2]} \quad \text{($\because$ plug in result)}\\ &\class{fragment}{{}= \frac{{\mathbb{E}}[ \textcolor{#458588}{T_i \tau_X} (T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i]) + \textcolor{#458588}{{\mathbb{E}}[Y_{i} {\:\vert\:}T_i = 0, X_i]} (T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i]) ]}{{\mathbb{E}}[(T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])^2]} \quad \text{($\because$ distribute)}} \\ &\class{fragment}{{}= \frac{{\mathbb{E}}[ \textcolor{#d65d0e}{T_i} \tau_X \textcolor{#d65d0e}{(T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])} ]}{{\mathbb{E}}[(T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])^2]} \quad \text{($\because$ independence of residuals)}} \\ &\class{fragment}{{}= \frac{{\mathbb{E}}_X [ \tau_X \textcolor{#d65d0e}{{\mathbb{E}}[ T_i^2 - T_i {\mathbb{E}}[T_i {\:\vert\:}X_i] {\:\vert\:}X_i]} ]}{{\mathbb{E}}_X [ \textcolor{#d65d0e}{{\mathbb{E}}[ (T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])^2 {\:\vert\:}X_i ]} ]} \quad \text{($\because$ iterated ${\mathbb{E}}$ )}} \\ &\class{fragment}{{}= \frac{{\mathbb{E}}_X [ \tau_X \textcolor{#d65d0e}{{\mathbb{V}}(T_i {\:\vert\:}X_i)} ]}{{\mathbb{E}}_X [ \textcolor{#d65d0e}{{\mathbb{V}}(T_i {\:\vert\:}X_i)} ] } \quad \text{($\because$ definition of ${\mathbb{V}}$ )}}\\ &\class{fragment}{{}= \frac{\sum_X \tau_X \textcolor{#d65d0e}{{\textrm{Pr}}(T_i = 1 {\:\vert\:}X_i = x) (1 - {\textrm{Pr}}(T_i = 1 {\:\vert\:}X_i = x))} {\textrm{Pr}}(X_i = x)}{\sum_X \textcolor{#d65d0e}{{\textrm{Pr}}(T_i = 1 {\:\vert\:}X_i = x) (1 - {\textrm{Pr}}(T_i = 1 {\:\vert\:}X_i = x))} {\textrm{Pr}}(X_i = x)}. \quad \text{($\because$ binary $T_i$)}} \end{align*} \]

What Was This?

Effect Heterogeneity


  • Compare

    \[ \tau_{ATE} = \sum_{X} \tau_x {\textrm{Pr}}(X_i = x), \]

    versus

    \[ \widehat{\tau} = \frac{\sum_X \tau_X \textcolor{#d65d0e}{{\textrm{Pr}}(T_i = 1 {\:\vert\:}X_i = x) (1 - {\textrm{Pr}}(T_i = 1 {\:\vert\:}X_i = x))} {\textrm{Pr}}(X_i = x)} {\sum_X \textcolor{#d65d0e}{{\textrm{Pr}}(T_i = 1 {\:\vert\:}X_i = x) (1 - {\textrm{Pr}}(T_i = 1 {\:\vert\:}X_i = x))} {\textrm{Pr}}(X_i = x)} \]

  • \(\widehat{\tau}\) aggregates via conditional variance weighting with respect to \(T_i\) instead of just probability.

  • If \(\tau_i\) was constant wrt \(X_i\), the precision weighting could be good from an efficiency standpoint.

  • If \(T_i {\mbox{$\perp\!\!\!\perp$}}X_i\), then \(\widehat{\tau}\) reduces to weighting by the number of units with \(X_i = x\).

Truth about Regression


  • Logic carries through to continuous treatments (Angrist and Pischke 2009, 77–80; Aronow and Samii 2016).

  • Aronow and Samii (2016) show that for arbitrary \(T_i\) and \(X_i\),

    \[ \widehat{\tau} \xrightarrow{p} \frac{{\mathbb{E}}[w_i \tau_i]}{{\mathbb{E}}[w_i]}, \quad \text{where } w_i = (T_i - {\mathbb{E}}[T_i | X_i])^2, \]

    in which case

    \[ {\mathbb{E}}[w_i | X_i] = {\mathbb{V}}[T_i {\:\vert\:}X_i]. \]

  • The effective sample is weighted by \(\widehat{w}_i = (T_i - \widehat{{\mathbb{E}}}[T_i | X_i])^2\) (squared residual from regression of \(T_i\) on covariates).

  • Even with a representative sample, regression estimates may not aggregate effects in a representative manner. Regression estimates are local to an effective sample.

Let’s Try a Simulation


set.seed(20250202) # set seed

n <- 1000 # sample size
tau_base <- 0.5
gamma <- 0.1 # effect of X on outcome

# some discrete covariate
X <- sample(x = 1:100, size = n, replace = T)

# total treatment effect (assuming possible heterogeneity)
tau_total <- sum((tau_base + 0.01 * 1:100) / 100)

# democratic institutions (correlated with confounder)
democracy_high <- rbinom(n, size = 1, prob = .5)
democracy_high_2 <-
  rbinom(n, size = 1, prob = sapply(X, function(x) .5 + 0.01 * x))

# economic growth (influenced by both investment and democratic institutions)
growth <-
  (tau_base + 0.01 * X) *
  democracy_high +
  gamma * X +
  rnorm(n, mean = 0, sd = 5)

growth_2 <-
  (tau_base + 0.01 * X) *
  democracy_high_2 +
  gamma * X +
  rnorm(n, mean = 0, sd = 5)

# regression ignoring the confounder
bias1 <- lm(growth ~ democracy_high + factor(X))$coefficients[2] - tau_total

# regression ignoring the confounder
bias2 <- lm(growth_2 ~ democracy_high_2 + factor(X))$coefficients[2] - tau_total

Heterogeneous \(\tau\) and Assignment

Heterogeneous \(\tau\) Only

Heterogeneous Assignment Only

Lessons

  • Regression is a useful tool for estimating causal effects and accounting for CIA:
    • Binary treatments: regression can provide consistent estimates of \(ATE\).
    • Discrete or continuous treatments: Estimates provide a best linear approximation when relationship is non-linear.
    • Heterogeneous treatments: Regression estimates are weighted by conditional variance and could be biased and apply to effective sample only.
  • Beware of OVB:
    • Avoid “bad controls” that may introduce post-treatment bias or inadvertently open back-door paths.
  • Steps to take:
    1. Be explicit about the assumption you need to make to make regression causal.
    2. Use DAGs and the back-door criterion to identify covariates to control for.
    3. Make sure you know how to interpret regression coefficients.
    4. Use simulations and/or Dagitty to validate model assumptions and relationships.

Appendix

Frisch-Waugh-Lovell Theorem 🔙


  • Consider a multiple regression model: \(Y_i = \alpha + \tau T_i + X_i^\prime \beta + \nu_i\).

  • To find \(\tau\), the coefficient on \(T_i\), the Frisch-Waugh-Lovell Theorem states that:

    1. Regress \(Y_i\) on \(X_i\) and obtain the residuals \(\tilde{Y}_i = Y_i - X_i^\prime \widehat{\beta}\).

    2. Regress \(T_i\) on \(X_i\) and obtain the residuals \(\tilde{T}_i = T_i - X_i^\prime \widehat{\delta}\).

    3. Regress \(\tilde{Y}_i\) on \(\tilde{T}_i\) to obtain \(\beta_1\).

    In addition, the \(R^2\) and F-statistics of these regressions will be the same as those from the full model regression.

  • Intuition:

    • The Frisch-Waugh-Lovell theorem decomposes the estimation process.
    • Adjusts \(Y_i\) and \(T_i\) for covariates \(X_i\) separately, highlighting the direct effect of \(T_i\).

References

Angrist, Joshua D, and Jörn-Steffen Pischke. 2009. Mostly Harmless Wconometrics: An Wmpiricist’s Companion. Princeton University Press.
Aronow, Peter M, and Cyrus Samii. 2016. “Does Regression Produce Representative Estimates of Causal Effects?” American Journal of Political Science 60 (1): 250–67.
Cinelli, Carlos, Andrew Forney, and Judea Pearl. 2024. “A Crash Course in Good and Bad Controls.” Sociological Methods & Research 53 (3): 1071–1104.
Cinelli, Carlos, and Chad Hazlett. 2020. “Making Sense of Sensitivity: Extending Omitted Variable Bias.” Journal of the Royal Statistical Society Series B: Statistical Methodology 82 (1): 39–67.